A Multi-strategy Approach for Catalog Integration

نویسندگان

  • Ryutaro Ichise
  • Masahiro Hamasaki
  • Hideaki Takeda
چکیده

When we have a large amount of information, we usually use categories with a hierarchy, in which all information is assigned. This paper proposes a new method of integrating two catalogs with hierarchical categories. The proposed method uses not only the contents of information but also the structures of both hierarchical categories. We conducted experiments using two actual Internet directories, and the results show improved performance compared with the previous approach. In this paper, we introduce a novel approach for catalog integration problem. The problem addressed in this paper is finding an appropriate category Ct in the target catalog TC for each information instance Isi in the source catalog SC . What we need to do is determine an appropriate category in TC for an information instance. In order to solve the problem, we proposed the Similaritybased integration (SBI) [3]. SBI has a higher performance compared with the Naive Bayes (NB) approach, even with the extension proposed by [1]. In this paper, we propose a method which combines the SBI approach and the NB approach. In order to combine handling the meaning of information, we propose using NB after SBI. A problem of SBI is that it is hard to learn a mapping rule when the destination category is in a lower category in the target concept hierarchy. In other words, the learned rules are likely to assign relatively general categories in the target catalog. In order to avoid this type of rules, we propose to combine a contents-based classification method after we apply the SBI algorithm. Since NB is very popular and easy to use, we adopt NB as the contents-based classification method. In order to apply the NB algorithm for hierarchical classification, we utilize the simple method of the Pachinko Machine NB. The Pachinko Machine classifies instances at internal nodes of the tree, and greedily selects sub-branches until it reaches a leaf [4]. This method is applied after the rule induced by SBI decides the starting category for the Pachinko Machine NB. In order to evaluate the proposed algorithm, we conducted experiments using real Internet directories collected from Yahoo! [5] and Google [2]. The data was collected during the period from December 2003 to January 2004. The locations in Yahoo! and Google are Photography. We conducted ten-fold cross validations for the links appeared in both directories. The shared links were divided into C. Zhang, H.W. Guesgen, W.K. Yeap (Eds.): PRICAI 2004, LNAI 3157, pp. 944–945, 2004. c © Springer-Verlag Berlin Heidelberg 2004 A Multi-strategy Approach for Catalog Integration 945 Fig. 1. Experimental Results ten data sets; nine of which were used to construct rules, and the remaining set was used for testing. Ten experiments were conducted for each data set, and the average accuracy is shown in Figure 1. The accuracy is measured for each depth of the Internet directories. The vertical axes in Figure 1 show the accuracy and horizontal axes show the depth of the concept hierarchies. The left side of Figure 1 shows the results obtained using Google as the source catalog and Yahoo! as the target catalog, and the right side of Figure 1 shows the results obtained using Yahoo! as the source catalog and Google as the target catalog. For comparison, these graphs also include the results of SBI. SBI-NB denotes the results of the method proposed in this paper. The proposed algorithm performs much better in accuracy than the original SBI. One reason for this is that the NB works well. In other words, the contents-based classification is suited for this domain. According to [3], the NB method does not achieve the performance of SBI in the Photography domain. However, our proposed algorithm effectively combines the contents-based method with the category similarity-based method. In this paper, a new technique was proposed for integrating multiple catalogs. The proposed method uses not only the similarity of the categorization of catalogs but also the contents of information instances. The performance of the proposed method was tested using actual Internet directories, and the results of these tests show that the performance of the proposed method is more accurate for the experiments.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Modified Multi Time Step Integration for Dynamic Analysis

In this paper new implicit higher order accuracy (N-IHOA) time integration based on assumption of constant time step is presented for dynamic analysis. This method belongs to the category of the multi time step integrations. Here, current displacement and velocity are assumed to be functions of the velocities and accelerations of several previous time steps, respectively. This definition causes...

متن کامل

Hierarchical Web Catalog Integration with Conceptual Relationships in a Thesaurus

Web catalog integration has become an integral aspect of current digital content management for Internet and e-commerce environments. The Web catalog integration problem concerns integration of documents in a source catalog into a destination catalog. Many investigations have focused on flattened (one-dimensional) catalogs, but few works address hierarchical Web catalog integration. This study ...

متن کامل

Multi Objective Scheduling of Utility-scale Energy Storages and Demand Response Programs Portfolio for Grid Integration of Wind Power

Increasing the penetration of variable wind generation in power systems has created some new challenges in the power system operation. In such a situation, the inclusion of flexible resources which have the potential of facilitating wind power integration is necessary. Demand response (DR) programs and emerging utility-scale energy storages (ESs) are known as two powerful flexible tools that ca...

متن کامل

تأثیر عناصر انعطاف‌پذیری تولید بر عملکرد کسب‌وکار در بنگاه‌های تولیدی صنعت خودرو و کاشی و سرامیک: مطالعه پیمایشی در سطح زنجیره

This study examines the impact of manufacturing flexibility factors) integration, supplier management strategy, and supplier selection strategy( on business performance. For this purpose, the relationship between integration (corporate strategy, technology, customer, and supplier integration), supplier management strategy (supplier early involvement, quality roadmap, and technology roadmap), an...

متن کامل

Simulation and Evaluation of Urban Development Scenarios Using Integration of Cellular Automata Model and Game Theory

Urban growth is a dynamic and evolutionary spatial and social process that relates to the changes of urban spatial units and the transformation of people’s lifestyles and consequently demographic changes. Considering the urban development process as a function of land uses interactions, population structure and the strategic behavior of the agents involved in the urban development process (the ...

متن کامل

A New Nonlinear Multi-objective Redundancy Allocation Model with the Choice of Redundancy Strategy Solved by the Compromise Programming Approach

One of the primary concerns in any system design problem is to prepare a highly reliable system with minimum cost. One way to increase the reliability of systems is to use redundancy in different forms such as active or standby. In this paper, a new nonlinear multi- objective integer programming model with the choice of redundancy strategy and component type is developed where standby strategy ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004